From Predictive Models to Instructional Policies 


Joseph Rollinson 
Computer Science Department 
Carnegie Mellon University 
jtrollinson@gmail.com 


Emma Brunskill 
Computer Science Department 
Carnegie Mellon University 
ebrun@cs.cmu.edu 


ABSTRACT 

At their core, Intelligent Tutoring Systems consist of a stu- 
dent model and a policy. The student model captures the 
state of the student and the policy uses the student model 
to individualize instruction. Policies require different prop- 
erties from the student model. For example, a mastery 
threshold policy requires the student model to have a way to 
quantify whether the student has mastered a skill. A large 
amount of work has been done on building student models 
that can predict student performance on the next question. 
In this paper, we leverage this prior work with a new when- 
to-stop policy that is compatible with any such predictive 
student model. Our results suggest that, when employed as 
part of our new predictive similarity policy, student mod- 
els with similar predictive accuracies can suggest that sub- 
stantially different amounts of practice are necessary. This 
suggests that predictive accuracy may not be a sufficient 
metric by itself when choosing which student model to use 
in intelligent tutoring systems. 

1. INTRODUCTION 

Intelligent tutoring systems offer the promise of highly ef- 
fective, personalized, scalable education. Within the ITS 
research community, there has been substantial work on con- 
structing student models that can accurately predict student 
performance (e.g. [6j[3j[T5j[5][To]|9j[T4j[7]). Another key is- 
sue is how to improve student performance through the use 
of instructional policy design. There has been significant 
interest in cognitive models used for within activity design 
(often referred to as the inner-loop) and even authoring tools 
developed to make designing effective activities easier (e.g. 
CTAT 1 ). However, there has been much less attention to 
outer-loop (what problem to select or when to stop) instruc- 
tional policies (though exceptions include [5| [12] [lT] ) . 

In this paper we focus on a common outer-loop ITS chal- 
lenge, adaptively deciding when to stop teaching a certain 
skill to a student given correct /incorrect responses. Some- 
what surprisingly, there are no standard policy rules or al- 
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gorithms for deciding when to stop teaching for many of the 
student models introduced over the last decade. Bayesian 
Knowledge Tracing [6] naturally lends itself to mastery teach- 
ing, since one can halt when the student has mastered a skill 
with probability above a certain threshold. Such a mastery 
threshold has been used as part of widely used tutoring sys- 
tems, but typically in conjunction with additional rules since 
a student may never reach a sufficient mastery threshold 
given the available activities. 

We seek to be able to directly use a wide range of student 
models to create instructional policies that halt both when a 
student has learned a skill and when the student seems un- 
likely to make any further progress given the available tutor- 
ing activities. To do so we introduce an instructional policy 
rule based on change in predicted student performance. 

Our specific contributions are as follows: 

• We provide a functional interface to student models 
that captures their predictive powers without knowl- 
edge of their internal mechanics (Section [3|. 

• We introduce the predictive similarity policy, a new 
when-to-stop policy that can take as input any pre- 
dictive student-model (Section |4j) and can halt both if 
students have successfully acquired a skill or do not 
seem able to do so given the available activities. 

• We analyze the performance of this policy compared 
to a mastery threshold policy on the KDD dataset and 
find our policy tends to suggest similar or a smaller 
number of problems than a mastery threshold policy 
(Section |HJ. 

• We also show that our new policy can be used to ana- 
lyze a range of student models with similar predictive 
performance (on the KDD dataset) and find that they 
can sometimes suggest very different numbers of in- 
structional problems. (Section [5|. 

Our results suggest that predictive accuracy alone can mask 
some of the substantial differences among student models. 
Polices based on models with similar predictive accuracy can 
make widely different decisions. One direction for future 
work is to measure which models produce the best learning 
policies. This will require new experiments and datasets. 


P(U) 
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Figure 1: BKT as a Markov process. Mastery and 
Non-Mastery are hidden states. Arrow values repre- 
sent the probability of the transition or observation. 

2. BACKGROUND: STUDENT MODELS 

Student models are responsible for modeling the learning 
process of students. The majority of student models are 
predictive models that provide probabilistic predictions of 
whether a student will get a subsequent item correct. In this 
section we describe two popular predictive student models, 
Bayesian knowledge tracing and latent factor models. Note 
that other predictive models, such as Predictive State Rep- 
resentations (PSRs), can also be used to calculate the prob- 
ability of a correct response jY; . 

2.1 Bayesian Knowledge Tracing 

Bayesian Knowledge Tracing (BKT) 6 tracks the state of 
the student’s knowledge as they respond to practice ques- 
tions. BKT treats students as being in one of two possible 
hidden states: Mastery and Non-Mastery. It is assumed 
that a student never forgets what they have mastered and if 
not yet mastered, a new question always has a fixed static 
probability of helping the student master the skill. These 
assumptions mean that BKT requires only four trained pa- 
rameters: 

P(Lo) Initial probability of mastery. 

P(T) Probability of transitioning to mastery over a 
single learning opportunity. 

P(G) Probability of guessing the correct answer when 
the student is not in the mastered state. 

P(S) Probability of slipping (making a mistake) when 
the student is in the mastered state. 

After every response, the probability of mastery, P(Lt), is 
updated with Bayesian inference. The probability that a 
student responds correctly is 

P BKT (C t ) = (1 - P(S))P(L t ) + P(G)( 1 - P(L t )). (1) 

Prior work suggests that students can get stuck on a par- 
ticular activity. Unfortunately, BKT as described above as- 
sumes that students will inevitably master a skill if given 
enough questions. As this is not always the case, in indus- 
try BKT is often used together with additional rules to make 
instructional decisions. 

2.2 Latent Factor Models 

Unlike BKT models, Latent Factor Models (LFM) do not di- 
rectly model learning as a process [3] . Instead, LFMs assume 
that there are latent parameters of both the student and skill 
that can be used to predict student performance. These pa- 
rameters are learned from a dataset of students answering 
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questions on multiple skills. The probability that the stu- 
dent responds correctly to the next question is calculated by 
applying the sigmoid function to the linear combination of 
parameters p and features /. 

a FM (C) = r — ^ (2) 

Additive Factor Models (AFM) [3] are based on the assump- 
tion that student performance increases with more ques- 
tions. A student is represented by an aptitude parameter 
(on) and a skill is represented by a difficulty parameter (/?*,) 
and learning rate ( 7 *,). AFM is sensitive to the number of 
questions the student has seen, but ignores the correctness 
of student responses. The probability that student i will 
respond correctly after n responses on skill k is 

Pafm (C) = 1 _ e -( ct+h+w) ■ ( 3 ) 

Performance Factor Models (PFM) 15 are an extension of 
AFMs that are sensitive to the correctness of student re- 
sponses. PFMs separate the skill learning rate into success 
and failure parameters, p,k and pk respectively. The prob- 
ability that student i will respond correctly after s correct 
responses and / incorrect responses on skill k is 

PpFM ^ = 1 _ e -(^ + i+^ +Pk f) ■ ( 4 ) 

LFMs can easily be extended to capture other features. For 
example, the instructional factors model In extends PFMs 
with a parameter for the number of tells (Interactions that 
do not generate observations) given to the student. To our 
knowledge there is almost no work on using LFMs to cap- 
ture temporal information about the order of observations. 
Unlike BKT, LFMs are not frequently used in instructional 
policies. 

Though structurally different, BKT models, AFMs and PFMs 
tend to have similar predictive accuracy [9] [15] . This raises 
the interesting question of whether instructional policies that 
use these models are similar. 

3. WHEN-TO-STOP POLICIES 

We assume a simple intelligent tutoring system that teaches 
students one skill at a time. All questions are treated the 
same, so the system only has to decide when to stop pro- 
viding the student questions. In this section, we provide a 
general framework for the when-to-stop problem. In partic- 
ular, we describe an interface that abstracts out the student 
model from instructional policies, which we will use to define 
the mastery threshold policy and use in the next section 
as the foundations of a model-agnostic instructional policy. 

3.1 Accessing Models 

Policies require a mechanism for getting values from stu- 
dent models to make decisions. We describe this mecha- 
nism as a state type and a set of functions. A student 
model consists of two types of values: immutable param- 
eters that are learned on training data and mutable state 
that changes over time. For example, the parameters for 
BKT are (P(Lo), P(T), P(G), P(S)) and the model state is 
the probability of mastery ( P(Lt )). Policies treat the state 



Figure 2: Model process with functional interface 


as a black box, which they pass to functions. All predic- 
tive student models must provide the following functions. 
startState(. . . ) returns the model state given that the stu- 
dent has not seen any questions. updateState(state, obs) 
returns an updated state given the observation. For this pa- 
per, observations are whether the student got the last ques- 
tion correct or incorrect. Finally, predictive student mod- 
els must provide predictCorrect(state), which returns the 
probability that the student will get the next question cor- 
rect. The function interfaces for BKT models and PFM 
are provided in table [I] Under this abstraction, when-to- 
stop policies are functions stop(state) that output true if 
the system should stop providing questions for the current 
skill and false if the system should continue providing the 
student with questions. 

3.2 Mastery Threshold Policy 

The mastery threshold POLICY halts when the student 
model is confident that the student has mastered the skill. 
This implies that we want to halt when the student masters 
the skill. Note that if the estimate of student mastery is 
based solely on a BK'lQ then a mastery threshold policy 
implicitly assumes that every student will master the skill 
given enough problems. Mathematically, we want to stop at 
time t if P(Lt) > A, where A is our mastery threshold. The 
mastery threshold policy function can be written as: 

stop A/ (state) = predictMastery (state) > A. (5) 

The mastery threshold policy can only be used with mod- 
els that include predictMastery(state) in their function 
set. BKT models are compatible, but LFMs are not. By 
itself, the mastery threshold does not stop if the student 
has no chance of attaining mastery in the skill with the given 
activities. Students on poorly designed skills could be stuck 
learning a skill indefinitely. 

4. FROM PREDICTION TO POLICY 

In educational data mining, a large emphasis is put on build- 
ing models that can accurately predict student observations. 
Our goal is to build a new when-to-stop policy that will work 
with any predictive student model. 

1 In practice, industry systems that use mastery thresholds 
and BKTs often use additional rules as well. 
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Our new instructional policy is based on a set of assump- 
tions. First, students working on a skill will eventually end 
in one of two hidden end-states. Either, they will master 
the skill, or they will be unable to master the skill given 
the activities available. Second, once students enter either 
end-state, the probability that they respond correctly to a 
question stays the same. Third, if the probability that a stu- 
dent will respond correctly is not changing, then the student 
is in an end-state. Finally, we should stop if the student is 
in an end-state. 

From these assumptions it follows that if the probability 
that the student will respond correctly to the next question 
is not changing, then we should stop. In other words, we 
should stop if it is highly likely that showing the student 
another question will not change the probability that the 
student will get the next question correct by a significant 
amount. We propose to stop if 

(P(\P(Ct) - P(C t+ 1)| <e))>5 (6) 

where P(Ct ) is the probability that the student will get the 
next question right. This can be thought of as a threshold on 
the sum of the probabilities of each observation that will lead 
to an insignificant change in the probability that a student 
will get the next question correct, which can be written as 

^2 P(O t = o)t(\P(Ct) - P(C t+ i\Ot =o)\<e)>5 (7) 

oeo 

where P(Ct+i\Ot = o ) is the probability that the student 
will respond correctly after observation o, Ot is the obser- 
vation at time t, and 1 is an indicator variable. In our case 
O = {C, _| C}. This expression is true in the following cases: 

1. P(C t ) > S and | P(C t ) - P(Ct+i\C t )\ < e 

2. P(-.C't) > <5 and | P(C t ) - P(C t+ i\^C t )\ < e 

3. \P(C t ) - P(Ct+i\C t )\ < e and 
\P{C t ) - P{C t+ ihCt)\ < e 

First, if a student is highly likely to respond correctly to the 
next question and the change in prediction is small if the 
student responds correctly, then we should stop. Second, 
if a student is highly unlikely to respond correctly to the 
next question and the change in prediction is small if the 
student responds incorrectly, then we should stop. Third, if 
the change in prediction is small no matter how the student 
responds, then we should stop. All terms in these expres- 
sions can be calculated from the predictive student model 
interface as shown in equations [8] and [9] We call the in- 
structional policy that stops according to these three cases 
the predictive similarity policy. The function for the 
predictive similarity policy is provided in algorithm [l] 

P(Ct) = predictCorrect(s) (8) 

P{Ct+i\Ot) = predictCorrect(updateState(s, Ot)) (9) 

5. EXPERIMENTS & RESULTS 

We now compare the predictive similarity policy to the 
mastery threshold policy and see if using different stu- 
dent models as input to the predictive similarity policy 
yields quantitatively different policies. 


Table 1: Functional interfaces for BKT and PFM 



BKT 

PFM 

startState(. . . ) 

P(Lo) 

{ot-i + fiki Mfc 5 Pk > 0? 0) 

updateState(s, o) 

P(L t+1 \P(L t ),O t +i =o) 

((w,p,p,s + l,f) if o = C 
\(w,V,P,s,f + 1) if o=->C 

predictCorrect (s) 

P(-iS)P(L t ) + P(G)(1 - P(L t )) 


predictMastery(s) 

P(Lt) 

— 


Algorithm 1 Predictive Similarity policy stop function 
1: function STOP(state) 

2: -P(Ct) <— predictCorrect (state) 

3: total 0 

4: if P{Ct) > 0 then 

5: state' «— updateState(state, correct) 

6: P(Ct+i\Ct) t— predictCorrect (state') 

7: if | P{C t ) - P(C t+ i\Ct)\ < e then 

8: total 4— total + P{Ct) 

9: if P{Ct) < 1 then 

10: state' <— updateState(state, incorrect) 

11: P(Ct+i\^Ct) «— predictCorrect(state') 

12: if | P(C t ) - P{C t+ ihCt)\ < t then 

13: total t— total + (1 — P{Ct )) 

14: return total > 8 


5.1 ExpOps 

In order to better understand the differences between two 
instructional policies we will measure the expected num- 
ber of problems to be given to students by a policy using 
the ExpOps algorithm. The ExpOps algorithm allows us to 
summarize an instructional policy into a single number by 
approximately calculating the expected number of questions 
an instructional policy would provide to a student. A naive 
algorithm takes in the state of the student model and re- 
turns 0 if the instructional policy stops at the current state 
or recursively calls itself with an updated state given each 
possible observation as shown in equation |10| It builds a 
synthetic tree of possible observations and their probabil- 
ity using the model state. The tree grows until the policy 
decides to stop teaching the student. This approach does 
not require any student data nor does it generate any obser- 
vation sequences. However, this algorithm may never stop, 
so ExpOps approximates it by also stopping if we reach a 
maximum length or if the probability of the sequence of ob- 
servations thus far drops below a path threshold as shown in 
algorithmic] In this paper, we use a path threshold of 1CP' 
and a maximum length of 100. 

f 0 if stop(s) 

E[Ops\ — < i -|_ ^ P(Ot = o)E[Ops\o\ otherwise 

( oGO 

Lee and Brunskill first introduced this metric to show that 
individualized models lead to significantly different policies 
than the general models [12] . 

5.2 Data 
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Algorithm 2 Expected Number of Learning Opportunites 
1: function ExpOPS(startState) 

2: function ExpOps' ( state, P(path), len) 

3: if P(path) < pathThreshold then 

4: return 0 

5: if len > maxLen then 

6: return 0 

7: if stop(state) then 

8: return 0 

9: P(C ) <— predictCorrect (state) 

10: P{W) <— 1 — P(C) 

11: expOpsSoFar <— 0 

12: if P(C) > 0 then 

13: P(path + c) <— P(path) * P(C) 

14: state' <— updateState(state, C) 

15: ops t— ExpOps' (state', P( path + c), len + 1) 

16: expOpsSoFar <— expOpsSoFar + (ops * P(C)) 

17: if P(W) > 0 then 

18: P(path + w) <— P(path) * P(W) 

19: state' t— updateState(state, incorrect) 

20: ops «— ExpOps' ( state', P(path + w), len + 1) 

21: expOpsSoFar <— expOpsSoFar + (ops* P(W)) 

22: return 1 + expOpsSoFar 

23: return ExpOps' (( startState, 1,0)) 


For our experiments we used the Algebra I 2008-2009 
dataset from the KDD Cup 2010 Educational Data Min- 
ing Challenge 18] . This dataset was collected from stu- 
dents learning algebra I using Carnegie Learning Inc.’s in- 
telligent tutoring systems. The dataset consists of 8,918,054 
rows where each row corresponds to a single step inside a 
problem. These steps are tagged according to three differ- 
ent knowledge component models. For this paper, we used 
the SubSkills knowledge component model. We removed all 
rows with missing data. We combined the rows into obser- 
vation sequences per student and per skill. Steps attached 
to multiple skills were added to the observation sequences of 
all attached skills. We removed all skills that had less than 
50 observation sequences. Our final dataset included 3292 
students, 505 skills, and 421,991 observation sequences. 

We performed 5-fold cross-validation on the datasets to see 
how well AFM, PFM, and BKT models predict student per- 
formance. We randomly separated the dataset into five folds 
with an equal number of observation sequences per skill in 
each fold. We trained AFM, PFM, and BKT models on four 
of the five folds and then predicted student performance on 


Table 2: Root Mean Squared Error on 5 Folds 


Fold 

BKT 

PFM 

AFM 

0 

0.353 

0.364 

0.368 

1 

0.359 

0.367 

0.371 

2 

0.358 

0.368 

0.371 

3 

0.366 

0.369 

0.374 

4 

0.353 

0.365 

0.368 


the leftover fold. We calculated the root mean squared error 
found in Table [2] Our results show that the three models 
had similar predictive accuracy, agreeing with prior work. 

5.3 Model Implementation 

We implemented BKT models as hidden Markov models us- 
ing a python package we developed. We used the Baum- 
Welch algorithm to train the models, stopping when the 
change in log-likelihood between iterations fell below 10 -5 . 
For each skill, 10 models with random starting parameters 
were trained, and the one with the highest likelihood was 
picked. Both AFM and PFM were implemented using scikit- 
learn’s logistic regression classifier 16 . We used LI nor- 
malization and included a fit intercept. The tolerance was 
10 -4 . We treated an observation connected to multiple skills 
as multiple observations, one per skill. It is also popular to 
treat them as a single observation with multiple skill param- 
eters. In the interest of reproducibility, we have published 
the models used as a python packager] 

5.4 Experiment 1: Comparing policies 

The mastery threshold policy is frequently used as a key 
part of deciding when to stop showing students questions. 
However without additional rules, it does not stop if students 
cannot learn the skill from the current activities. In this 
experiment we compare the predictive similarity policy 
to the mastery threshold policy to see if the predictive 
similarity policy acts like the mastery threshold policy 
when students learn and stops sooner when students are 
unable to learn with the given tutoring. We based both 
policies on BKT models. 

We ran ExpOps on each skill for both policies. For the mas- 
tery threshold policy, we used the community standard 
threshold of A = 0.95. For the predictive similarity 
policy, we decided that the smallest meaningful change in 
predictions is 0.01 and that our confidence should be 0.95, 
so we set e = 0.01 and S = 0.95. We then split the skills 
into those where the BKT model trained on them had se- 
mantically meaningful parameters and the rest. A BKT 
model was said to have semantically meaningful parameters 
if P(G ) < 0.5 and P(S) < 0.5. 218 skills had semantically 
meaningful parameters and 283 did not|^] 

2 The packages are available at http:// 

www. jrollinson. com/re sear ch/2015/edm/ 
from-predictive-models- to- instructional-policies . 
html 

°'We found similar results for both experiments using 
BKT models trained through brute force iteration on 
semantically meaningful values. These results can be found 
at http : //www. jrollinson. com/re sear ch/2015/edm/ 

from-predictive-models- to- instructional-policies . 
html 
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Figure 3: ExpOps using the mastery threshold pol- 
icy and the predictive similarity policy on skills with 
and without semantically meaningful parameters. 

The Pearson correlation coefficient between ExpOps values 
calculated using the two policies on skills with semantically 
meaningful parameters was 0.95. This suggests that the two 
policies make very similar decisions when based on BKT 
models with semantically meaningful parameters. However, 
the Pearson correlation coefficient between ExpOps values 
calculated using the two policies on skills that do not have 
semantically meaningful parameters was only 0.55. To un- 
cover why the correlation coefficient was so much lower on 
skills that do not have semantically meaningful parameters, 
we plotted the ExpOps values calculated with the mastery 
threshold policy on the X-axis and the ExpOps values 
calculated with the predictive similarity policy on the 
Y-axis for each skill as shown in figure [3] This plot shows 
that the predictive similarity policy tends to either agree 
with the mastery threshold policy or have a lower Ex- 
pOps value on skills with parameters that are not semanti- 
cally meaningful. This suggests that the predictive simi- 
larity policy is stopping sooner on skills that students are 
unlikely to learn. The mastery policy does not give up on 
these skills, and instead teaches them for a long time. 

5.5 Experiment 2: Comparing models with 
the predictive similarity policy 

The previous experiment suggests that the predictive sim- 
ilarity policy can effectively mimic the good aspects mas- 
tery threshold policy when based on a BKT model. We 
now wish to see how using models with similar predictive ac- 
curacy, but different internal structure will affect it. LFMs 
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Figure 4: ExpOps plots for the predictive similarity policy using BKT, AFM, and PFM. 


Table 3: Correlation coefficients on ExpOps values 
from policies using BKT, AFM, and PFM. 


Models 

Coefficient with 
all skills 

Coefficient with 
skills not stopped 
immediately 

AFM vs. PFM 

0.32 

0.72 

AFM vs. BKT 

-0.06 

0.44 

PFM vs. BKT 

0.16 

0.46 


and BKT models have vastly different structure making 
them good models for this task. Our earlier results also 
found that AFM, PFM, and BKT models have similar pre- 
dictive accuracy. We ran ExpOps on each skill with the pre- 
dictive similarity policy based both on AFM and PFM. 
AFM and PFM require a student parameter, which we set 
to the mean of their trained student parameters. This is 
commonly done when modeling a student that has not been 
seen before. We compared the ExpOps values for these two 
models with the values for the BKT-based predictive sim- 
ilarity policy calculated in the previous experiment. 

We first looked at how many skills the different policies im- 
mediately stopped on. We found that the BKT-based policy 
stopped immediately on 31 (6%) of the skills, whilst PFM 
stopped immediately on 130 (26%) and AFM stopped im- 
mediately on 295 (59%). 

We calculated the correlation coefficient between each pair 
of policies on all skills as well as just on skills in which both 
policies did not stop immediately as shown in table [3] We 
found that AFM and PFM had the highest correlation co- 
efficient. For each pair of policies, we found that removing 
the immediately stopped skills had a large positive impact 
on correlation coefficient. The BKT-based policy had a cor- 
relation coefficient of 0.44 with the AFM-based policy and 
0.46 with the PFM-based policy on skills that were not im- 
mediately stopped on. This suggests that there is a weak 
correlation between LFM-based and BKT-based policies. 

We plotted the ExpOps values for each pair of policies, 
shown in figure [f] The AFM vs. PFM plot reiterates that 
the AFM-based and PFM-based policies have similar Ex- 
pOps values on skills where AFM does not stop immediately. 
The BKT vs. PFM plot shows that the PFM-based policy 
either immediately stops or has a higher ExpOps value than 
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the BKT-based policy on most skills. 

To understand why the PFM-based policy tends to either 
stop immediately or go on for longer than the BKT-based 
policy, we studied two skills. The first skill is ‘Plot point on 
minor tick mark — integer major fractional minor’ on which 
the BKT-based policy has an ExpOps value of 7.0 and the 
PFM-based policy has an ExpOps value of 20.7. The sec- 
ond skill is ‘Identify solution type of compound inequality 
using and’ on which the BKT-based policy has an ExpOps 
value of 11.4 and the PFM-based policy immediately stops. 
We calculated the predictions of both models on two ar- 
tificial students, one who gets every question correct and 
one who gets every question incorrect. In figure [5j we plot 
the prediction trajectories to see how the predictions of the 
two models compare. In both plots, the PFM-based policy 
asymptotes slower than the BKT-based policy. Since LFMs 
calculate predictions with a logistic function, PFM predic- 
tions asymptote to 0 when given only incorrect responses 
and 1 when given only correct responses, whereas the BKT 
model’s predictions asymptote to P(G ) and 1 — P(S) respec- 
tively. In the first plot, the PFM-based policy learns at a 
slower rate than the BKT-based policy, but the predictions 
do begin to asymptote by the 20 th question. In the second 
plot, the PFM-based policy learns much more slowly. Af- 
ter 25 correct questions, the PFM-based policy’s prediction 
changes by just over 0.1, and after 25 incorrect questions, the 
PFM-based policy’s predictions changes by less than 0.03. 
In contrast, the BKT-based policy asymptotes over 10 ques- 
tions to 1 — P(S) = 0.79 when given correct responses and 
P(G) = 0.47 when given incorrect responses. 

This figure also shows how the parameters of a BKT model 
affect decision making. P(Lo) is responsible for the initial 
probability of a correct response. P(S) and P(G) respec- 
tively provide the upper and lower asymptotes for the prob- 
ability of a correct response. P{T) is responsible for the 
speed of reaching the asymptotes. For the predictive simi- 
larity policy, the distance between the initial probability of 
a correct response and the asymptotes along with the speed 
of reaching the asymptotes is responsible for the number of 
questions suggested. 

6. DISCUSSION 

Our results from experiment 1 show that the predic- 
tive similarity policy performs similarly to the mastery 
threshold policy on BKT models with semantically mean- 
ingful parameters and suggests the same or fewer problems 
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- ■ ■ BKT always incorrect ■ • ■ • PFM always incorrect 

Figure 5: Predictions of BKT models and PFMs if 
given all correct responses or all incorrect responses 
on two skills. 


on BKT models without semantically meaningful parame- 
ters. Thus, this experiment suggests that the two instruc- 
tional policies treat students successfully learning skills sim- 
ilarly. The lower ExpOps values for the predictive simi- 
larity policy provide evidence that the predictive simi- 
larity policy does not waste as much student time as the 
mastery threshold policy on its own. Fundamentally, the 
mastery threshold policy fails to recognize that some stu- 
dents may not be ready to learn a skill. The predictive 
similarity policy does not make the same error. Instead, 
the policy stops either when the system succeeds in teaching 
the student or when the skill is unteachable by the system. 
In practice mastery threshold policies are often used in 
conjunction with other rules such as a maximum amount of 
practice before stopping. A comparison of such hybrid poli- 
cies to the predictive similarity policy is an interesting 
direction for future work. However, it is important to note 
that such hybrid policies would still require the underlying 
model to have a notion of mastery, unlike our predictive 
similarity policy. 

The predictive similarity policy can be used to uncover 
differences in predictive models. Experiment 2 shows that 
policies based on models with the same predictive power 
can have widely different actions. AFMs had a very sim- 
ilar RMSE to both PFMs and the BKT models, but im- 
mediately stopped on a majority of the skills. An AFM 
must provide the same predictions to students who get many 
questions correct and students who get many questions in- 
correct. To account for this, its predictions do not change 
much over time. One may argue that this suggests that 
AFM models are poor predictive models, because their pre- 
dictions hardly change with large differences in state. Both 
AFMs and PFMs have inaccurate asymptotes because it is 
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likely that students who have mastered the skill will not get 
every question correct and that students who have not mas- 
tered the skill will not get every question incorrect. This 
means that these models will attempt to stay away from 
their asymptotes with lower learning rates. One possible so- 
lution would be to build LFMs that limit the history length. 
Such a model could learn asymptotes that are not 0 and 1. 

7. RELATED WORK 

Predictive student models are a key area of interest in the 
intelligent tutoring systems and educational data mining 
community. One recent model incorporates both BKT and 
LFM into a single model with better predictive accuracy 
than both 10 . It assumes that there are many problems 
associated with a single skill, and each problem has an item 
parameter. If we were to use such a model in a when-to-stop 
policy context, the simplest approach would be to find the 
problem with the highest learning parameter for that skill, 
and repeatedly apply it. However, this reduces Khajah et 
al.’s model to a simple BKT model, which is why we did not 
explicitly compare to their approach. 

Less work has been done on the effects of student models 
on policies. Fancsali et al. [8] showed that when using the 
mastery threshold policy with BKT one can view the 
mastery threshold as a parameter controlling the frequency 
of false negatives and false positives. This work focused on 
simulated data from BKT models. Since BKT assumes that 
students eventually learn, this work did not consider wheel- 
spinning. Rafferty et al. [Tt] showed that different models of 
student learning of a cognitive matching task lead to signifi- 
cantly different partially observable Markov decision process 
policies. Unlike our work which focuses on deciding when- 
to-stop teaching a single activity type, that work focused 
on how to sequence different types of activities and did not 
use a standard education domain (unlike our use of KDD 
cup). Mandel et al. [13 did a large comparison of differ- 
ent student models in terms of their predicted influence on 
the best instructional policy and expected performance of 
that policy in the context of an educational game; however, 
like Rafferty et al. their focus was on considering how to se- 
quence different types of activities, and instead of learning 
outcomes they focused on enhancing engagement. Chi et 
al. 5 performed feature selection to create models of stu- 
dent learning designed to be part of policies that that would 
enhance learning gains on a physics tutor; however, the fo- 
cus again was on selecting among different types of activities 
rather than a when-to-stop policy. Note that neither BKT 
nor LFMs in their original form can be used to select among 
different types of problems, though extensions to both can 
enable such functionality. An interesting direction of future 
work would be to see how to extend our policy to take into 
account different types of activities. 

Work on when-to-stop policies is also quite limited. Lee 
and Brunskill 12] showed that individualizing student BKT 
models has a significant impact on the expected number of 
practice opportunities (as measured through ExpOps) for a 
significant fraction of students. Koedinger et al. 1 1 1 showed 
that splitting one skill into multiple skills could significantly 
improve learning performance; this process was done by hu- 
man experts and leveraged BKT models for the policy de- 
sign. Cen et al.|4j improved the efficiency of student learn- 


ing by noticing that AFM models suggested that some skills 
were significantly over or under practiced. They created new 
BKT parameters for such skills and the result was a new tu- 
tor that helped students learn significantly faster. However, 
the authors did not directly use AFM to induce policies, 
but rather used an expert based approach to transform the 
models back to BKT models, which could be used with ex- 
isting mastery approaches. In contrast, our approach can be 
directly used with AFM and other such models. 

Our policy assumes that learning is a gradual process. If 
you were to instead subscribe to an all-at-once method of 
learning, you could possibly use the moment of learning as 
your stopping point. Baker et al. provide a method of de- 
tecting the moment at which learning occurs However, 
this work does not attempt to build instructional policies. 

8. CONCLUSION & FUTURE WORK 

The main contribution of this paper is a when-to-stop pol- 
icy with two attractive properties: it can be used with any 
predictive student model and it will provide finite practice 
both to students that succeed in learning a given skill and 
to those unable to do so given the presented activities. 

This policy allowed us for the first time to compare com- 
mon predictive models (LFMs and BKT models) in terms of 
their predicted practice required. In doing so we found that 
models with similar predictive error rates can lead to very 
different policies. This suggests that if they are to be used 
for instructional decision making, student models should not 
be judged by predictive error rates alone. One limitation of 
the current work is that only one dataset was used in the 
experiments. To confirm these results it would be useful to 
compare to other datasets. 

One key issue raised by this work is how to evaluate instruc- 
tional policy accuracy. One possible solution is to run trials 
with students stopping after different numbers of questions. 
The student would take both a pre and post-test, which 
could be compared to see if the student improved. How- 
ever, such a trial would require many students and could be 
detrimental to their learning. 

There is a lot of room for extending this instructional policy. 
First, we would like to incorporate other types of interac- 
tions, such as dictated information (“tells”) or worked ex- 
amples, into the predictive similarity policy. This would 
give student models more information and hopefully lead 
to better predictions. Second, the predictive similarity 
policy is myopic, and we are interested in the effects of ex- 
panding to longer horizons. Third, we are excited about ex- 
tending this instructional policy to choosing between skills. 
Instead of stopping when there is a high probability of pre- 
dictions not changing, the instructional policy could return 
either the skill that had the highest chance of a significant 
change in prediction, or the skill with the highest expected 
change in prediction. 
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